\section{Experimental results}\label{sec:results}
\subsection{Testbed}
All of our performance results use a hardware testbed that consists of
two Intel Xeon E5 2440 v2 1.9GHz servers, each with 1 CPU socket and
8 physical cores with hyperthreading enabled.
Each target server has an Intel 10GbE X540-AT2 dual
port NIC, with the two ports of the Intel NIC on one server connected
to the two ports on the identical NIC on the other server.
We installed p4c-xdp on one server, the {\em target server}, and
attached the XDP program to the port that receives the packets
The other server, the {\em source server}, generates packets
at the maximum 10~Gbps packet rate of 14.88~Mpps using the DPDK-based
TRex~\cite{trex} traffic generator.  The source server sends minimum
length 64-byte packets in a {\em single} UDP flow to one port of the
target server, and receives the forwarded packets on the same port.
At the target server, we use only one core to process all packets.
Every packet received goes through the pipeline specified in P4.

We use the sample p4 programs in the tests directory and the following
metrics to understand the performance impact of the P4-generated XDP
program:
\begin{itemize}
\item Packet Processing Rate (Mpps): Once the XDP program finishes
  processing the packet, it returns one of the actions mentioned in
  section~\ref{sec:background}.  When we want to count the number of
  packets that can be dropped per second, we modify each p4 program to
  always return XDP\_DROP.
\item CPU Utilization: Every packet processed by the XDP program is run
  under the per-core software IRQ daemon, named
  \texttt{ksoftirqd/\textit{core}}.  All packets are processed by only
  one core with one kthread, the ksoftirqd, and we measure the CPU
  utilization of the ksoftirqd on the core.
\item Number of BPF instructions verified: For each program, we list
  the complexity as the number of BPF instructions the eBPF
  verifier scans.
\end{itemize}

The target server is running Linux kernel 4.19-rc5 and for all our
tests, the BPF JIT (Just-In-Time) compiler is enabled and JIT hardening
is disabled. All programs are compiled with clang 3.8 with llvm 5.0.
For each test program, we use the following
command from iproute2 to load it into kernel:
\begin{lstlisting}[frame=none]
ip link set dev eth0 xdp obj xdp1.o verb
\end{lstlisting}

The Intel 10GbE X540 NIC is running the \texttt{ixgbe} driver with 16 RX queues
set-up. Since the source server is sending single UDP flow, packets
always arrive at a single queue ID.  As a result, we collect the number
of packets being dropped at this queue.

\subsection{Results}

To compute the baseline performance we wrote two small XDP programs by
hand: \texttt{SimpleDrop}, drops all packets by returning
\texttt{XDP\_DROP} immediately.  \texttt{SimpleTX} forwards the packet
to the receiving port returning \texttt{XDP\_TX}.  Each of these
programs consists of only two BPF instructions.

\begin{lstlisting}[frame=none]
    /* SimpleDrop */
    0: (b7) r0 = 1 // XDP_DROP
    1: (95) exit

    /* SimpleTX */
    0: (b7) r0 = 3 // XDP_TX
    1: (95) exit
\end{lstlisting}

After, we attached the following P4 programs to the receiving device:
\begin{itemize}
\item xdp1.p4: Parse Ethernet/IPv4 header, deparse it, and drop.
\item xdp3.p4: Parse Ethernet/IPv4 header, lookup a MAC address
in a map, deparse it, and drop.
\item xdp6.p4: Parse Ethernet/IPv4 header, lookup and get a new TTL value
from eBPF map, set to IPv4 header, deparse it, and drop.
\item xdp7.p4: Parse Ethernet/IPv4/UDP header, write a pre-defined source port
and source IP, recalculate checksum, deparse, and drop.
\item xdp11.p4: Parse Ethernet/IPv4 header, swap src/dst MAC address,
deparse it, and send back to the same port (XDP\_TX).
\item xdp15.p4: Parse Ethernet header, insert a customized 8-byte header,
deparse it, and send back to the same port (XDP\_TX).
\end{itemize}

\begin{table}
\centering
\small
\begin{tabular}{llll}
  \underline{P4 program} & \underline{CPU Util.} & \underline{Mpps} & \underline{Insns./Stack}\\
  SimpleDrop & 75\% & 14.4 & 2/0 \\
  SimpleTX & 100\% & 7.2 & 2/0 \\
  xdp1.p4 &  100\% &  8.1 & 277/256 \\
  xdp3.p4 &  100\% &  7.1 & 326/256 \\
  xdp6.p4 &  100\% &  2.5 & 335/272 \\
  xdp7.p4 &  100\% &  5.7 & 5821/336 \\
  xdp11.p4 &  100\% &  4.7  & 335/216 \\
  xdp15.p4 &  100\% &  5.5 & 96/56\\
\end{tabular}
\caption{\footnotesize Performance of XDP program generated by
  p4c-xdp compiler using single core.}
\label{tab:perf}
\end{table}

As shown in Table~\ref{tab:perf}, xdp1.p4 allows us to measure the
overhead introduced by parsing and deparsing: a drop from 14.4~Mpps to
8.1~Mpps.  xdp3.p4 reduces the rate by another million PPS due to the
eBPF map lookup (this operation always returns NULL, no value from the
map is accessed).  xdp6.p4 has significant overhead because it
accesses a map, finds a new TTL value, and writes to the IPv4 header.
Interestingly, although xdp7.p4 does extra parsing to the UDP header
and checksum recalculation, it has only a moderate overhead because of
the lack of map accesses.

Finally, xdp11.p4 and xdp15.p4 show the transmit (XDP\_TX)
performance.  Compared with xdp11, xdp15.p4 invokes the
\texttt{bpf\_adjust\_head} helper function to reset the pointer for extra
bytes.  It does not incur much overhead because there
is already a reserved space in front of every XDP packet frame.

\subsection{Performance Analysis}

To further understand the performance overhead of programs generated
by p4c-xdp, we started broke down the CPU utilization. We used the Linux
perf tool on the process ID of the ksoftirqd that shows 100\%:
\begin{lstlisting}[frame=none]
perf record -p <pid of ksoftirqd> sleep 10
\end{lstlisting}


\noindent The following output shows the profile of xdp1.p4:
{\scriptsize
\begin{verbatim}
 83.19% [kernel.kallsyms] [k] ___bpf_prog_run
 8.14%  [ixgbe]           [k] ixgbe_clean_rx_irq
 4.82%  [kernel.kallsyms] [k] nmi
 1.48%  [kernel.kallsyms] [k] bpf_xdp_adjust_head
 1.07%  [kernel.kallsyms] [k] __rcu_read_unlock
 0.40%  [ixgbe]           [k] ixgbe_alloc_rx_buffers
\end{verbatim}
}

This confirms that most of the CPU cycles are spent on executing the
XDP program, \texttt{\_\_\_bpf\_prog\_run}, which caused us to investigate the
eBPF C code of xdp1.p4.

\begin{table}
\centering
\small
\begin{tabular}{llll}
  \underline{P4 program} & \underline{CPU Util.} & \underline{Mpps} & \underline{Insns./Stack}\\
  xdp1.p4 &  77\% &  14.8 & 26/0 \\
  xdp3.p4 &  100\% &  13 & 100/16 \\
  xdp6.p4 &  100\% &  12 & 98/40 \\
\end{tabular}
\caption{\footnotesize Performance of XDP program without deparser.}
\label{tab:perf2}
\end{table}

After commenting out the deparser C code, performance increases
significantly (see Table~\ref{tab:perf2}).  In the generated
code, the p4c-xdp compiler always writes back the entire packet
content, even when the P4 program does not modify any fields.  In
addition, the parser/deparser incur byte-order translation, e.g.,
htonl, ntohl.  This could be avoided by always using network
byte-order in P4 and XDP.  We plan to implement optimizations to
reduce this overhead.
